Linear regression using numpy
Linear Regression¶
Linear Regression is a statistical approach of modelling the realtionship between a scalar response (aka. dependent variable) and one or more explanatory variables (aka independent variables). Linear Regression is one of the most basic building block in Machine Learning algorithms.
In most of the python libraries, linear regression model is available as a blackbox. However, it is imperative to understand what goes on under the hood in order to have better grasp of algorithm.
This article will focus on implementing vanilla linear regression from scratch using numpy and pandas.
Lets start by importing some basic python libraries.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
$Notations \ that \ we \ will \ be \ using \ throughout \ are \ as \ follows -$
\begin{align}X - Input \ data \end{align}
\begin{align}Y - Target \ labels \end{align}
\begin{align}W - \ weights \ vector \ (to \ be \ estimated \ using \ training \ data) \end{align}
\begin{align}b - \ intercept \ for \ the \ decision \ boundary \end{align}
$ Notations \ for \ tracking \ dimensions \ are \ as \ follows- $
\begin{align}m - number \ of \ training \ examples \end{align}
\begin{align}n - number \ of \ input \ features \end{align}
\begin{align}k - \ size \ of \ labels \ vector (In \ case \ of \ binary \ classification,\ k\ =\ 1) \end{align}
Load Diabetes dataset provided by sklearn¶
Diabetes dataset ¶
Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- Age
- Sex
- Body mass index
- Average blood pressure
- S1
- S2
- S3
- S4
- S5
- S6
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times n_samples
(i.e. the sum of squares of each column totals 1).
Source URL: http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see: Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499. (http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
def load_diabetes_data():
diabetes = datasets.load_diabetes()
X, Y = diabetes['data'], diabetes['target'].reshape(len(diabetes['target']), 1)
return X, Y
Helper functions for scaling data.¶
We shall leverage the existing scaling function supported by Sklearn.
def get_scale_object(data):
scaler = MinMaxScaler()
scaler.fit(data)
return scaler
def get_scaled_data(scaler, data):
return scaler.transform(data)
$\ While \ implementing \ Linear \ Regression \ from \ scratch, \ it \ is \ important \ to \ keep \ track \ of \ dimensions.$
$\ For \ this \ tutorial, \ we \ shall \ use \ following \ dimensions -$
\begin{equation*} \tag{1}\label{1} X =\ m \times \ n \ \end{equation*}
\begin{align}\tag{2}\label{2} Y = \ m \times \ 1 \end{align}
\begin{align}\tag{3}\label{3} W = \ 1 \times \ n \end{align}
\begin{align}\tag{4}\label{4}b = \ 1 \times \ 1 \end{align}
The pseudo-code for training a Linear Regression model is as follows -
Step 1 -¶
Given a Training DataSet X and corresponding labels Y, initialize values for number of training examples(m), number of features(n) and number of output units (k) as mentioned in equations \ref{1}, \ref{2}, \ref{3} and \ref{4}
Step 2 -¶
Initialize Weights matrix W and bias b using dimensions from Step 1.
Step 3 - Forward Propagation -¶
Step 3.1 compute preactivations using \ref{5}¶
\begin{align} y_{pred} = h(\theta) = XW^T +b \tag{5}\label{5} \end{align}Lets validate if the dimensions match - \begin{equation} dim(y_{pred}) = \mathbf{(\ m \times \ n )} \times \ \mathbf{(1 \times \ n)}^\intercal \ + \ (1 \times 1) \end{equation}
We are good to move ahead.
Step 3.2 - Compute loss /error /cost using \ref{6}¶
\begin{align} cost = - \ \frac{1}{2m} \sum_{i =1}^{m} (y_{true} - y_{pred})^2 \tag{7} \label{6} \end{align}The above equation represents the 'mean squared error'. It is important to note that, cost/ error/ loss is a scaler which we would like to minimize using gradient descent.
Step 4 - Back propagation -¶
Back propagation technique will help in updating the weights such that the MSE loss is minimized. Lets dig into the nuts and bolts of back propagation.
We shall derive the gradients using chain rule -
Lets Start by computing the partial derivative - $\frac{\partial L}{\partial h(\theta)}$
\begin{align} \frac{\partial L}{\partial h(\theta)} & = \frac{\partial }{\partial h(\theta)} \Big[ - \ \frac{1}{2m} \sum_{i =1}^{m} (y_{true} - h(\theta))^2 \Big] \end{align}
For one training example $i$ -
\begin{align} \frac{\partial L}{\partial h(\theta)} & = \frac{-1}{2} \frac{\partial}{\partial h(\theta)} \Big[(y_{true} - h(\theta))^2 \Big]\\ & = \frac{-1}{2} 2\Big[ y_{true} - h(\theta)\Big] \\ & = -\Big[ y_{true} - h(\theta)\Big] \\ & = \Big[h(\theta) - y_{true} \Big] \tag{9} \label{9} \end{align}Now, The last part of this derivation to compute gradients with respect to weights/coeffiecients.
\begin{align}\\ \frac{\partial L}{\partial W} & = \frac{\partial L}{\partial h(\theta)} \frac{\partial h(\theta)}{\partial W} \\ & = \frac{\partial L}{\partial h(\theta)} \Big[ \frac{\partial h(\theta)}{\partial W_1} \frac{\partial h(\theta)}{\partial W_2} .... \frac{\partial h(\theta)}{\partial W_n}\Big] \end{align}The derivatives $\Big[ \frac{\partial Z}{\partial W_1} \frac{\partial Z}{\partial W_2} .... \frac{\partial Z}{\partial W_n}\Big]$ can be easily calculated as follows -
\begin{align} \\ \frac{\partial h(\theta)}{\partial W_1} & = \frac{\partial }{\partial W_1} \Big(x_1w_1 +x_2w_2 + ....+x_nw_n + b \Big) \\ & = x_1 \end{align}Similarly, \begin{align} \\ \frac{\partial h(\theta)}{\partial W_2} = x_2 \\ \frac{\partial h(\theta)}{\partial W_3} = x_3 \\. \\. \\. \frac{\partial h(\theta)}{\partial W_n} = x_n \end{align}
Now, the gradient of bias term b is simply - \begin{align}\\ \frac{\partial L}{\partial b} & = \frac{\partial L}{\partial h(\theta)} \frac{\partial h(\theta)}{\partial B} \\ & = (h(\theta) - Y^{(i)}) \frac{\partial }{\partial b} \Big(x_1w_1 +x_2w_2 + ....+x_nw_n + b \Big) \\ & = (h(\theta) - Y{(i)}) (1) \end{align}
Therefore, for one training example, the gradient wrt weights can be concisely computed using -
\begin{align} \\ \frac{\partial L}{\partial W} = (h(\theta)-Y^{(i)}) [x_1^{(i)} x_2^{(i)} x_3^{(i)} .... x_n^{(i)}] \\ \frac{\partial L}{\partial b} = (h(\theta)-Y^{(i)}) \end{align}By extending the computations for one training example to m training examples, we get -
\begin{align} \\ \frac{\partial L}{\partial W} = \sum_{i =1}^{m} \frac{\partial {L^{(i)}}}{\partial W} \tag{13}\label{13} \\ \frac{\partial L}{\partial b} = \sum_{i =1}^{m} \frac{\partial {L^{(i)}}}{\partial b} \tag{14}\label{14} \end{align}Equations \ref{13} and \ref{14} can be summarised as follows -
\begin{align} \\ \frac{\partial L}{\partial W} = \frac{1}{m}(h(\theta) - Y)^T X \tag{15}\label{15} \\ \frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i =1}^{m} (h(\theta) - Y^{(i)}) \tag{16}\label{16} \end{align}Step 4.1 The weights can be updated using gradient descent equations as follows -¶
\begin{align} \\ W = W - \frac{\partial L}{\partial W} \tag{17} \label{17} \\ b = b - \frac{\partial L}{\partial b} \tag{18} \label{18} \end{align}Implement Step 1-¶
raw_X, Y = load_diabetes_data()
scaler = get_scale_object(raw_X)
X = get_scaled_data(scaler, raw_X)
Helper functions for initializing vectors¶
def initialize_dimensions(X):
m, n = X.shape
return m, n
def initialize_weight_vectors(dim):
W = np.zeros(shape = dim)
b = np.zeros(shape = (1, 1))
return W, b
Initialize weights vector¶
m, n = initialize_dimensions(X)
W, b = initialize_weight_vectors((1, n))
functions for computing cost and gradient descent -¶
def compute_regression_cost(y, y_pred, m):
cost = (1.0/(2*m)) * (np.sum(y_pred - y)**2)
return cost
def compute_regression_gradients(y_pred, y, X, m):
dw = (1.0/m)* np.dot((y_pred - y).T, X)
db = (1.0/m) * np.sum((y_pred - y), axis = 1, keepdims = True)
return dw, db
def propagation(w, b, X, Y):
# FORWARD PROPAGATION (FROM X TO COST)
z = (np.dot(X,w.T) + b)
cost = compute_regression_cost(Y, z, X.shape[1])
# BACKWARD PROPAGATION (TO FIND GRAD)
dw, db = compute_regression_gradients(z, Y, X, X.shape[0])
cost = np.squeeze(cost)
grads = {"dw": dw,
"db": db}
return grads, cost
def train_linear_regression(X, Y, learning_rate = 0.2, n_epochs = 4000, print_cost = True):
m, n = initialize_dimensions(X)
print "There are %d features and %d training examples"%(n, m)
w, b = initialize_weight_vectors((1, n))
print "W has a shape of %d rows and %d columns"%(w.shape[0], w.shape[1])
print "b has a shape of %d rows and %d columns"%(b.shape[0], b.shape[1])
costs = []
for iteration in range(n_epochs):
grad, cost = propagation(w, b, X, Y)
dw = grad['dw']
db = grad['db']
w = w - learning_rate * dw
b = b - learning_rate * db
# Record the costs
if iteration % 10 == 0:
costs.append(cost)
# Print the cost every 100 training examples
if print_cost and iteration % 100 == 0:
y_pred = np.dot(X, w.T) + b
accuracy = np.sqrt(np.mean((y_pred - Y)**2))
display("Cost after iteration %i: %f | rmse after iteration %i: %f" % (iteration, cost, iteration, accuracy))
params = {"w": w,
"b": b}
grads = {"dw": dw,
"db": db}
return params, grads, costs
params, _, _ = train_linear_regression(X, Y, print_cost= True)
y_predicted = (np.dot(X, params["w"].T) + params["b"])
rmse = np.sqrt(np.mean((y_predicted - Y)**2))
ssr = (np.mean((y_predicted - Y)**2))
sst = (np.mean((np.mean(Y) - Y)**2))
R2_score = 1.0 - (ssr/sst)
display("Model R2 score is %f"%R2_score)
Incredible !!¶
We have successfully trained a Linear Regression Algorithm.